PLSC30500, Fall 2024

Part 3. Learning from random samples (part a)

Andy Eggers

Sampling & sample mean

Motivation

So far we have talked about

  • probability theory
    • random events \(A, B\)
    • random variables \(X, Y\)
    • PMF/PDF \(f(x)\) and CDF \(F(x)\)
    • joint \(f_{X, Y}(x, y)\), marginal \(f_X(x)\), conditional \(f_{Y \mid X}(y \mid x)\)
  • summarizing distributions
    • expectation \({\textrm E}[X]\) (& MSE)
    • variance \({\textrm V}[X]\)
    • covariance/correlation \(\text{Cov}[X, Y], \rho[X, Y]\)
    • conditional expectation \({\textrm E}[Y \mid X = x]\) and CEF \({\textrm E}[Y \mid X]\)
    • conditional variance \({\textrm V}[Y \mid X = x]\)

Motivation (2)

So far: Given known random process (known contents of urn), what will we observe? (Probability.)

Now: We switch to statistics – we try to figure out what is in a population (the urn) from a sample.

We will still sometimes assume we know what it’s in the urn so that we can evaluate our procedures.

Sampling example

Suppose we want to measure a characteristic of a large population (e.g. average concern about climate change in US on 0-4 scale).

We contact a sample of size \(n = 1000\).

Let \(X_i\) denote the response of the \(i\)th person we contact (so we have \(X_1, X_2, \ldots, X_n\)).

Is \(X_1\) a random variable? What is its PMF/PDF? And what about \(X_2, \ldots, X_n\)?

A & M call the PMF of a RV sampled from a population the finite population mass function \(f_{FP}(x)\).

IID: key concept in statistics

Our sample \(X_1, X_2, \ldots, X_n\) is independent and identically distributed (IID) if

  • each of \(X_1, X_2, \ldots, X_n\) is drawn from the same distribution (identically distributed)
  • each subset of \(X_1, X_2, \ldots, X_n\) is mutually independent

Then \(X_1, X_2, \ldots, X_n\) can be thought of as \(n\) samples from a single RV \(X\).

Are these sampling approaches IID?

  • sample \(n\) people from the US census with replacement
  • sample \(n\) people from the US census without replacement
  • sample \(1\) person from the US census, then interview a randomly selected friend of that person, then a friend of that person, until \(n\) (snowball sampling)

IID is an approximation that lets us treat \(X_1, X_2, \ldots, X_n\) as repeated samples from the same RV \(X\). We will use it.

Sample statistic

Definition 3.2.1 Sample statistic

For IID random variables \(X_1, X_2, \ldots, X_n\), a sample statistic \(T_{(n)}\) is a function of \(X_1, X_2, \ldots, X_n\):

\[T_{(n)} = h_{(n)}(X_1, X_2, \ldots, X_n)\]

where \(h_{(n)}: \mathbb{R}^n \rightarrow \mathbb{R}, \forall n \in \mathbb{N}\).

Examples of sample statistics: sample mean, sample variance, sample covariance, regression coefficient

Because sample statistics are function of random variables, they are random variables (cf population mean)

Sample mean

For i.i.d. random variables \(X_1, X_2, \ldots, X_n\), the sample mean is

\[\overline X = \frac{X_1 + X_2 + \ldots + X_n}{n} = \frac{1}{n} \sum_{i = 1}^{n} X_i\]

\(\overline{X}\) is a RV (and a sample statistic). Let’s summarize its distribution!

Expectation of the sample mean

Proof that \(E[\overline{X}] = E[X]\) (Theorem 3.2.3):

\[\begin{align} {\textrm E}[\overline{X}] &= {\textrm E}\left[\frac{1}{n}(X_1 + X_2 + \ldots + X_n) \right] \\\ &= \frac{1}{n} {\textrm E}\left[X_1 + X_2 + \ldots + X_n \right] \\\ &= \frac{1}{n} \left( {\textrm E}[X_1] + {\textrm E}[X_2] + \ldots + {\textrm E}[X_n] \right) \\\ &= \frac{1}{n} \left( n {\textrm E}[X] \right) = {\textrm E}[X] \end{align}\]{n} \
&= ( + + + ) \
&= ( n ) = \end{align}

Illustration in R

set.seed(122)
xs <- c(1,2,3)
probs <- c(.2, .5, .3)
# E[X], analytically
sum(xs*probs)
[1] 2.1
# E[X], numerically: mean of a large sample from f(x)
mean(sample(x = xs, size = 10000, replace = T, prob = probs))
[1] 2.0942
# E[\overline{X}], numerically: mean of many sample means 
storage <- rep(NA, 10000)
for(i in 1:length(storage)){
  storage[i] <- mean(sample(x = xs, size = 10, replace = T, prob = probs))
}
mean(storage)
[1] 2.10102

Sampling variance of the sample mean

Okay, so \({\textrm E}[\overline{X}] = {\textrm E}[X]\). What else can we say about its distribution?

How close will \(\overline{X}\) be to \({\textrm E}[X]\)?

One measure of potential (in)accuracy is \({\textrm V}[\overline{X}]\).

Theorem 3.2.4 says \({\textrm V}[\overline{X}] = \frac{{\textrm V}[X]}{n}.\) (See homework.)

What does this mean?

Weak law of large numbers (WLLN)

If \({\textrm E}[\overline{X}] = {\textrm E}[X]\), and \({\textrm V}[\overline{X}] = \frac{{\textrm V}[X]}{n}\), then with large \(n\) isn’t \(\overline{X}\) likely to give us something very close to \({\textrm E}[X]\)?

Yes! That’s what the weak law of large numbers says.

Theorem 3.2.8 Let \(X_1, X_2, \ldots, X_n\) be i.i.d. random variables with finite variance \(\text{V}[X] > 0\), and let \(\overline{X}_{(n)} = \frac{1}{n} \sum_{x=1}^n X_i\). Then

\[\overline{X}_{(n)} \overset{p}{\to} \text{E}[X]\]

(where \(\overset{p}{\to}\) means “convergence in probability”, as \(n\) increases)

Usually you don’t know \({\textrm E}[X]\), but WLLN tells us that if \(n\) is large then \(\overline{X}\) is probably close to it.

WLLN illustration (0)

Bernoulli random variable, \(p = 2/3\).

If we take a large sample (e.g. \(n = 1000\)), we can show the sample mean at each value of \(n\) from \(20, 21, \ldots, 1000\).

On the next slides we show the results of doing this once, ten times, etc:

WLLN illustration (1)

WLLN illustration (2)

WLLN illustration (3)

WLLN illustration (4)

WLLN illustration (5)

\({\textrm E}[\overline{X}] = {\textrm E}[X]\) illustration

\({\textrm V}[\overline{X}] = {\textrm V}[X]/n\) illustration

Gambler’s fallacy (optional)

“WLLN says ‘With more \(n\), \(\overline{X}\) should get closer and closer to \({\textrm E}[X]\).’ The roulette ball hasn’t landed on red in a while. By the WLLN, I know that the ball is now especially likely to land on red.”

What is the gambler missing?

The plug-in principle and the sample variance

Motivation

Estimating sample means is boring. When do we get to the good stuff? Prediction, causal inference, regression, machine learning, deep learning, etc.

But remember that

  • the conditional expectation function (CEF) is the best simple summary of one RV using others (based on MSE)
  • the best linear predictor (BLP) of the CEF is a function of means, variances, and covariances
  • means, variances, and covariances can all be expressed as population expectations (\({\textrm E}[Y], {\textrm E}[X], {\textrm E}[X^2], {\textrm E}[XY]\))
  • population expectations can be approximated by sample means (\(\overline{Y}, \overline{X}, \overline{X^2}, \overline{XY}\)), which get closer to the target with larger \(n\)

So it really is all about estimating sample means! This is the “plug-in principle”.

Estimation theory terminology (1)

  • Estimand \(\theta\): what we want to estimate
  • Estimator \(\hat{\theta}\): a function of the sample \(h(X_1, X_2, \ldots, X_n)\) (and therefore an RV) that we use to estimate \(\theta\)
  • Estimate (noun): the value of \(\hat{\theta}\) for a given sample

Which one is \({\textrm E}[X]\)?

Which one is \(\overline{X}\)?

Estimation theory terminology (2)

Sampling distribution of an estimator: The distribution of \(\hat{\theta}\) (over repeated samples), as summarized by PMF/PDF \(f(\hat{\theta})\) or CDF \(F(\hat{\theta})\)

Bias of an estimator: The bias of an estimator \(\hat{\theta}\) is \({\textrm E}[\hat{\theta}] - \theta\).

If \({\textrm E}[\hat{\theta}] = \theta\), \(\hat{\theta}\) is unbiased.

Sampling variance of an estimator: The sampling variance of an estimator \(\hat{\theta}\) is \({\textrm V}[\hat{\theta}]\).

Standard error of an estimator: The standard error of an estimator \(\hat{\theta}\) is \(\sigma[\hat{\theta}] = \sqrt{{\textrm V}[\hat{\theta}]}\).

Plug-in principle (again)

Plug-in principle: “Write down the feature of the population that we are interested in, and then use the sample analog to estimate it” (A&M, page 116)

For example,

  • define estimand in terms of population expectations, and
  • turn it into an estimator by replacing population expectations by sample means.

Plug-in principle: application (1)

Above we established that \({\textrm V}[\overline{X}] = \frac{{\textrm V}[X]}{n}\). (Remember what this means?)

But we never know \({\textrm V}[X]\), the population variance of \(X\).

So how can we estimate \({\textrm V}[X]\) from the sample? Plug-in principle!


Estimand in terms of expectations: \[{\textrm V}[X] = {\textrm E}[X^2] - {\textrm E}[X]^2\]

Estimator in terms of sample means: \[\hat{\text{V}}_{\text{plug-in}}[X] = \overline{X^2} - \overline{X}^2\]

Could also do \(\overline{(X - \overline{X})^2}\).

Plug-in principle: application (2)

Our plug-in sample variance estimator: \(\hat{\text{V}}_{\text{plug-in}}[X] = \overline{X^2} - \overline{X}^2\)

Suppose our sample is samp below:

samp <- c(1,5,2,6,3,4,2)

How would we compute the plug-in sample variance?

mean(samp^2) - mean(samp)^2
[1] 2.77551

Or:

mean((samp - mean(samp))^2)
[1] 2.77551

Plug-in sample variance: biased!

R’s var() function gives us a different answer:

mean(samp^2) - mean(samp)^2 # our plug-in sample variance estimator
[1] 2.77551
var(samp)                # R's var() function
[1] 3.238095

Why? The plug-in sample variance estimator is biased (especially in small samples), and R’s var() function corrects for this.

Plug-in sample variance: why biased?

Why is it biased?

  • In each sample, the plug-in sample variance is the average squared difference from the sample mean
  • In a “weird” sample, the sample mean is too high/low, so the average squared differences are too small
  • This doesn’t tend to cancel out: “weird-high” and “weird-low” both give answers that are too small (cf sample mean)

Example

Suppose Bernoulli RV \(X\) with \(p = 1/2\).

What is \({\textrm V}[X]\)?

Taking samples of size \(n = 2\), the possible samples are:

\((x_1, x_2)\) \(\overline{x}\) \(f(\overline{x})\) \(\hat{V}_{\text{plug-in}}[X] = \overline{(x - \overline{x})^2}\)
(0, 0) 0 1/4 0
(0,1) or (0,1) 1/2 1/2 1/4
(1,1) 1 1/4 0

So \({\textrm E}\left[\hat{V}_{\text{plug-in}}[X]\right] = 0 \times 1/2 + 1/4 \times 1/2 = 1/8 < {\textrm V}[X] = 1/4\).

Plug-in sample variance: how biased?

To see how biased (and how to correct):

\[\begin{align} {\textrm E}\left[\hat{\text{V}}_{\text{plug-in}}[X]\right] &= {\textrm E}[\overline{X^2} - \overline{X}^2] \\ &= {\textrm E}[\overline{X^2}] - {\textrm E}[\overline{X}^2] \\ &= {\textrm E}[\overline{X^2}] - \left({\textrm E}[\overline{X}]^2 + {\textrm V}[\overline{X}]\right) \tag{*Def} \\ &= {\textrm E}[X^2] - {\textrm E}[X]^2 - \frac{{\textrm V}[X]}{n} \\ &= \overbrace{{\textrm V}[X]}^{\text{target}} - \overbrace{\frac{{\textrm V}[X]}{n}}^{\text{variance of }\overline{X}} \\ &= \frac{n - 1}{n} {\textrm V}[X] \end{align}\]

*Def: \({\textrm V}[\overline{X}] = {\textrm E}[\overline{X}^2] - {\textrm E}[\overline{X}]^2\)

Illustration by simulation (optional)

Plan:

  • define RV \(X\)
  • draw \(m\) samples of size \(n\), compute plug-in sample variance in each one
  • compare average estimate to \({\textrm V}[X]\)

Simulation (optional)

Variances to keep straight

  • \({\textrm V}[X]\): the (population) variance of \(X\)
  • \({\textrm V}[\overline{X}]\): the (sampling) variance of the sample mean (\(= \frac{{\textrm V}[X]}{n}\))
  • \(\hat{{\textrm V}}_{\text{plug-in}}[X]\): the plug-in sample variance (\(\overline{X^2} - \overline{X}^2\), a biased estimator of \({\textrm V}[X]\))
  • \(\hat{{\textrm V}}[X]\): the sample variance (\(\frac{n}{n-1}\hat{\text{V}}_{\text{plug-in}}[X]\), an unbiased estimator of \({\textrm V}[X]\))

Common situation:

  • we have a sample of size \(n\): \(\, x_1, x_2, \ldots, x_n\)
  • we report the sample mean \(\overline{X} = \frac{x_1 + x_2 + \ldots + x_n}{n}\)
  • we want to report the variance of the sample mean, \({\textrm V}[\overline{X}] = \frac{{\textrm V}[X]}{n}\), but we don’t know \({\textrm V}[X]\)
  • we instead report \(\hat{{\textrm V}}[X]/n\), an estimator of the variance of the sample mean

Plug-in principle wrap-up (for now)

  • Many estimands can be expressed in terms of population expectations (e.g. \({\textrm E}[X], {\textrm E}[XY]\))
  • Sample means (e.g. \(\overline{X}, \overline{XY}\)) are good approximations of population expectations
  • “Plug in” the sample means and you have a plug-in estimator

We’ll do this again!

Estimand/estimator wrap-up

Estimand Estimator Biased?
Pop. mean, \({\textrm E}[X]\) \(\overline{X}\) \({\textrm E}[\overline{X}] = {\textrm E}[X]\)
Pop. variance, \({\textrm V}[X]\) \(\hat{{\textrm V}}_{\text{plug-in}}[X]\) \({\textrm E}\left[\hat{{\textrm V}}_{\text{plug-in}}[X]\right] = \frac{n-1}{n} {\textrm V}[X]\)
Pop. variance, \({\textrm V}[X]\) \(\hat{{\textrm V}}[X]\) \({\textrm E}[\hat{{\textrm V}}[X]] = {\textrm V}[X]\)
(Sampling) var. of sample mean, \({\textrm V}[\overline{X}]\) \(\frac{\hat{{\textrm V}}[X]}{n}\) \({\textrm E}[\hat{{\textrm V}}[\overline{X}]] = {\textrm V}[\overline{X}]\)

Central limit theorem

Recap/motivation

What do we know about the sample mean \(\overline{X}\) so far?

  • \({\textrm E}[\overline{X}] = {\textrm E}[X]\) (unbiased)
  • \({\textrm V}[\overline{X}] = \frac{{\textrm V}[X]}{n}\) (variance depends predictably on \({\textrm V}[X]\) and \(n\))
  • \(\overline{X}_{(n)} \overset{p}{\to} {\textrm E}[X]\) (WLLN)

Can we say more about \(\overline{X}\)’s sampling distribution?

For example, what is \(\text{Pr}\left[\overline{X} - {\textrm E}[X] > c\right]\) for some \(c\)?

Repeated sample means

Consider Bernoulli random variable \(X\):

\[ f(x) = \begin{cases} 1/2 & x = 0 \\ 1/2 & x = 1 \\ 0 & \text{otherwise} \end{cases} \] (Equivalently, large population with equal number of 1s and 0s.)

If we draw 10,000 samples of size \(n\) and record the sample mean each time, what will the distribution of these sample means look like? (The sampling distribution of the sample mean.)

Case 1: \(n = 2\)

Case 2: \(n = 3\)

Case 3: \(n = 5\)

Case 4: \(n = 1000\)

Case 4: \(n = 1000\)

Central limit theorem

Theorem 3.2.24 Central limit theorem

Let \(X_1, X_2, \ldots, X_n\) be i.i.d. random variables with finite \({\textrm E}[X] = \mu\) and finite \({\textrm V}[X] = \sigma^2 > 0\). Then

\[ \overline{X} \overset{d}{\to} N\left(\mu, \frac{\sigma^2}{n}\right),\] i.e. the normal distribution with mean \(\mu\) and variance \(\frac{\sigma^2}{n}\).

You already know two parts of this, which are true for any sample size:

  • \({\textrm E}[\overline{X}] = {\textrm E}[X] = \mu\)
  • \({\textrm V}[\overline{X}] = \frac{{\textrm V}[X]}{n} = \frac{\sigma^2}{n}\)

The new part is the shape: as \(n\) goes to \(\infty\), the sampling distribution of \(\overline{X}_{(n)}\) becomes more normal.

Central limit theorem (two forms)

Theorem 3.2.24 Central limit theorem

Let \(X_1, X_2, \ldots, X_n\) be i.i.d. random variables with finite \({\textrm E}[X] = \mu\) and finite \({\textrm V}[X] = \sigma^2 > 0\). Then

\[\begin{align} \frac{\sqrt{n} \left(\overline{X} - \mu\right)}{\sigma} &\overset{d}{\to} N(0, 1) \tag{Version 1} \\ \overline{X} - \mu &\overset{d}{\to} \frac{\sigma}{\sqrt{n}} N(0, 1) \\ \overline{X} - \mu &\overset{d}{\to} N\left(0, \frac{\sigma^2}{n} \right) \\ \overline{X} &\overset{d}{\to} N\left(\mu, \frac{\sigma^2}{n} \right) \tag{Version 2} \end{align}\]

Illustration using above example

CLT intuition for Bernoulli RV

Let \(X\) be Bernoulli random variable (e.g. coin flip)

Suppose \(n = 4\). How many ways are there to get a sample mean \(\overline{X}\) of

  • 0
  • 1/4
  • 1/2
  • 3/4
  • 4/4

CLT intuition for Bernoulli RV (2)

Generally, how many ways to get \(k\) successes in \(n\) trials?

\[{n \choose k} = \frac{n!}{k!(n - k)!} \]

In R:

choose(n = 1000, k = 1000)
[1] 1
choose(n = 1000, k = 500)
[1] 2.702882e+299

CLT intuition for Bernoulli RV (3)

Let’s compute the number of ways to get each number of heads between 0 and 1000 in 1000 tries:

ks <- 0:1000 # number of heads
n <- 1000 # sample size

nways <- rep(NA, 1001) # storage for for-loop
for(i in 1:length(ks)){
  # number of ways to get k successes in n trials
  nways[i] <- choose(k = ks[i], n = n) 
}
head(nways, 4)
[1]         1      1000    499500 166167000

CLT intuition for Bernoulli RV (4)

Since each sequence of flips is equally likely, we can convert “number of ways” into “probability” by dividing by the total number of possible sequences.

plot(ks/1000, nways/sum(nways), pch = 19, cex = .25, 
     xlab = "Proportion of heads in 1000 tries", ylab = "Probability of observing this proportion")

CLT intuition for Bernoulli RV (5)

CLT intuition more broadly

For Bernoulli \(X\) (a very “un-normal” PMF!), \(\overline{X}\) normally distributed (with large \(n\)) because more ways to get sample mean close to \({\textrm E}[X]\).

Extend that intuition to other \(X\)’s:

  • there are many more ways to get a sample mean close to \({\textrm E}[X]\) than one far away, so
  • you are much more likely to get a sample mean close to \({\textrm E}[X]\) than one far away, and
  • the normal distribution tells you how much more likely.

Galton board

Source: https://www.youtube.com/watch?v=EvHiee7gs9Y

CLT: big enough sample?

CLT: big enough sample?

CLT: big enough sample?

CLT: big enough sample?

CLT: not just for sample means

We focused on the sample mean, but (given “mild regularity conditions”) all plug-in estimators are asymptotically normal (Theorem 3.3.6).

(Mild regularity conditions means that small changes in the CDF produce small changes in our sample statistic (technically statistical functional).)


Intuition:

  • all estimands can be represented as functions of CDF; all plug-in estimators can be described as functions of the empirical CDF
  • across samples, empirical CDF (in sample) will resemble population CDF with mostly small discrepancies, some larger discrepancies
  • we get normal sampling distribution of \(\hat{\theta}\) if small discrepancies in CDF produce small deviations \(\hat{\theta} - \theta\), large discrep. produce large \(\hat{\theta} - \theta\)

When the CLT does not apply

Some practically irrelevant exceptions implied by the CLT theorem statement:

Sample mean of \(X\) when

  • \({\textrm E}[X] = \infty\), e.g. the St Petersburg paradox
  • \({\textrm V}[X] = 0\)
  • non-iid sampling that eventually leads to sampling one person over and over

More practically relevant exceptions:

  • mode of a discrete RV
  • median or other quantile of a discrete RV

No CLT here!

No CLT here either!